IAES International Journal of Artificial Intelligence (IJ-AI) 


Vol. 9, No. 2, June 2020, pp. 212~220 


ISSN: 2252-8938, DOI: 10.1159 1/ijai.v9.i2.pp2 12-220 0 212 


Sentiment analysis of informal Malay tweets with deep learning 


Ong Jun Ying'!, Muhammad Mun’im Ahmad Zabidi’, Norhafizah Ramli*, Usman Ullah Sheikh‘4 
‘Department of Device Development Group, Intel Penang, Malaysia 
23.4School of Electrical Engineering, Faculty of Engineering, Universiti Teknologi Malaysia, Malaysia 


Article Info 


ABSTRACT 


Article history: 


Received Jan 16, 2020 
Revised Mar 6, 2020 
Accepted Apr 17, 2020 


Keywords: 


Bahasa Indonesia 
Convolutional neural network 
Malay 

Natural language processing 


Twitter is an online microblogging and social-networking platform which 
allows users to write short messages called tweets. It has over 330 million 
registered users generating nearly 250 million tweets per day. As Malay is 
the national language in Malaysia, there is a significant number of users 
tweeting in Malay. Tweets have a maximum length of 140 characters which 
forces users to stay focused on the message they wish to disseminate. 
This characteristic makes tweets an interesting subject for sentiment analysis. 
Sentiment analysis is a natural language processing (NLP) task of classifying 
whether a tweet has a positive or negative sentiment. Tweets in Malay are 
chosen in this study as limited research has been done on this language. 
In this work, sentiment analysis applied to Malay tweets using the deep 
learning model. We achieved 77.59% accuracy which exceeds similar work 
done on Bahasa Indonesia. 
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1. INTRODUCTION 

The advancement of the Internet causes the world to face ever-growing volumes of data 
in various forms. Applications such as social media, review sites, forums, blogs, and others generate 
enormous heaps of data in the form of sentiment, opinion, user's views, arguments about different social 
events, politics, products and more. The sentiments expressed by users have a very great influence on the 
readers, product vendors, and even politicians. As a result, many companies depend on Internet users’ 
feedback and opinions to market their products and services on social media such as Twitter, Facebook, 
Snapchat, and Instagram. Companies are interested in knowing what users think about their services or 
products. The changes in these business models had created a huge business opportunity for the sentiment 
analysis of the data [1]. Malay is Malaysia's national language. In Malaysia, there are a significant number of 
users who are using Malay to express their opinions and arguments on social media. There is a huge business 
opportunity for building a Malay sentiment analysis model. However, very limited research has been 
attributed to Malay sentiment analysis [2]. 

Data on social media are mostly unstructured. The large amounts of unstructured data make it very 
difficult for a human to extract and summarize the opinion contained therein. Often, human errors will occur 
thus reducing the accuracy of the data analysis. Automated sentiment analysis will greatly reduce the 
workload for a human to analyze the data [3-4]. To exploit this opportunity, many start-ups are now 
providing sentiment analysis services for the public. Likewise, many big corporations are also building their 
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in-house capabilities in sentiment analysis. These practical applications and industrial interests have provided 
a strong motivation for research in sentiment analysis. 


2. LITERATURE REVIEW 
2.1. Natural language processing 

Natural Language Processing is a field that intersects of artificial intelligence, computer science, and 
linguistics. It refers to an interaction between machines and humans in natural language [5]. The ultimate 
goal of NLP is to enable computers to understand language as well as humans do. NLP is divided into three 
categories, which are: 
— Natural Language Understanding: Computers’ ability to understand what we say. 
— Natural Language Generation: The generation of natural language by a computer. 
— Speech Recognition: The translation of spoken language into text. 

NLP had been widely applied to different applications, such as predictive typing, sentiment analysis, 
spell checking, spam detection, and others. 


2.2. Sentiment analysis 

Sentiment analysis (SA) is also known as opinion mining [6]. It refers to build a system to identify 
and extract opinions from the text. It often able to extract the attribute of the expression, which includes: 
— Subject: Topic that is talked about. 
— Polarity: Positive or negative opinion. 
— Opinion holder: Entity that expresses an opinion. 

Sentiment analysis generates great interest mainly due to having many practical applications. 
The constant expansion of the Internet produces large volumes of textual data expressing opinions. 
With sentiment analysis, this messy and unstructured information will be automatically transformed into 
structured data. These data contain the public opinion about services, products, brands and different topics 
that people could express opinions about. The data is very valuable for commercial applications like 
marketing analysis, product feedback, customer services, public relationship, and others. In short, the SA 
system helps companies to make sense of this sea of unstructured text by automating business processes, 
getting actionable insights, and saving time on manual data processing. This research focuses solely on the 
polarity extraction aspect of SA. 

There are several ways or approaches to implement the sentiment analysis system. As shown in 
Figure 1, these approaches are often be categorized into two major categories, which are machine learning 
approaches and lexicon-based approaches [7]. 
— Machine Learning Approaches: Learn the features from labelled data, then apply it to a set of unknown 

data. 
— Lexicon Based Approaches: Work on the dictionary with a list of words/sentences with different 
polarity, output polarity will be some of the polarities on different words/sentences [8]. 


Figure 1. Sentiment classification techniques 


Machine learning approaches rely on using algorithms to automatically extract the features from a data set. 
After the models have been sufficiently trained, it will be fed with unknown text and then returns with the 
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corresponding polarity (positive, negative, or neutral). The sentiment analysis task is usually categorized in 
NLP under the text classification segment and therefore is a type of classifier. 


2.3. Data pre-processing 

Data pre-processing refers to the steps that need to be employed before it can be transformed into 
the numerical feature that can work with the machine learning algorithm [9]. The raw text comes with a 
variety of forms, for example, individual words, to sentences, multiple paragraphs, even with special 
characters and contains a lot of redundant information. 

All these raw texts need to be cleaned up and trim down to the root format that can work well with 
the machine learning algorithm. Good text pre-processing enhances the classification accuracy and as well as 
training efficiency [10]. The different text pre-processing techniques that are often used are stated below: 

— Tokenizer: To create a token for each word in the sentences/document. 

— Capitalization: Changing all the words to lower case or upper case. 

— Abbreviation Replacement: Expand the abbreviation to the original words. 

— Stop word removal: Remove the meaningless words. For example: ’is’, ’the’, ’are’. 

— Stemming: Process of removing the endings of the word. For example: changing 'studies' to 'study', 
changing ‘studying’ to ‘study’. 

—  Lemmatization: Process of grouping the different inflected forms of a word. For example: changing 
‘studies’ and ‘studying’ to ‘study’. 

— Spelling Correction: Recover the words from typos. 

— Part of speech tagging: tag token for the type of the word. For example Nouns, Verbs, Adjectives. 

— Remove numbers: Remove the numerical expression. 

— Named entity recognition: label the words with respective entities. For example, name, location, dates, 
address. 

Not all pre-processing techniques are suitable for every scenario. The techniques are chosen on a case 

by case basis. 


2.4. Text to numerical representation 

The computer only understands the data which is a numerical format, so do the machine learning 
algorithms. Before inputting the data for the machine learning algorithm to process, we need to convert the 
data from textual format to numerical format. This process is also known as text vectorization. The common 
techniques for text vectorization are: 

— Bag Of Words (BoW): BoW is a basic and straightforward technique. With this method, the order and 
grammar of the words are discarded. It only tells whether a word is present in the document or not. Each 
word of the entire data set is corresponding to a column, if the particular exist in the input sentences, 
vector representation of this sentence has a | in the corresponding column for this word [11]. 

— Term Frequency — Inverse Document Frequency (TF-IDF): Each word within an input sentence is 
replaced with its TF-IDF score and created a vector out of these scores for each input sentence. The 
whole idea of this measure is to give more importance to the terms that are more specific to a certain 
class (TF) and reduce the importance of the term that is very frequent in the entire corpus (IDF) [12]. 

— Word Embeddings: A way of statistically extracting the meaning of a word from the text and 
representing it with a set of numbers. To be able to understand the meaning of a word, these models use 
contextual similarities. With this technique, words with similar meanings tend to have a similar 
representation. The word2vec method is commonly used for word embeddings [13]. 


2.5. Classification algorithms 
Many classification algorithms exist in machine learning. For sentiment analysis, statistical models 
like Support Vector Machine (SVM), Naive Bayes (NB) or neural networks are commonly used. 
— Naive Bayes (NB): Use Bayes’s Theorem (probabilistic algorithms) to predict the category 
of a text [14]. 
— Support Vector Machine (SVM): Text is represented as a point in a multidimensional space. The points 
will be mapped to different categories. New text is mapped onto the same space and predicted to belong 
to categories based on the region that lies [15]. 
— Artificial Neural Networks (ANN) and Deep Learning (DL): Using a diverse set of algorithms to imitate 
how the actual human brain works by employing artificial neural networks to process data [16]. 
Convolutional Neural Network (CNN) such as shown in Figure 2 is a type of ANN that is 
commonly used in image recognition. Initially, it was designed to process the pixel data, but recently CNN 
also has been applied to text classification tasks and achieve very good performance [17-19]. 
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The architecture of CNN pretty much like a multi-layer perceptron that has been designed for 
reduced processing requirements. In general, CNN will consist of an input layer, a hidden layer, and output 
layers. The hidden layer includes multiple convolutional layers, normalization layer, pooling layer and fully 
connected layer [16]. CNN has forward and reverse transmission. In forward transmission, the data input 
structures undergo multiple layers of processes and output structures in the output layer, an activation 
function is needed in every layer. In reverse transmission, the probability of error is calculated using a given 
model result and the forward transmission result, and its transfer to the error function back to each respective 
layer, lastly the gradient descent technique is applied to tune the bias parameter and network weights to 
obtain better accuracy. 


Fully Fully Output 
Convolution Pooling Convolution Pooling Connected Connected predictions 


road (0.85) 


ary (0.08) 


car (0.07) 


Figure 2. Convolutional Neural Network [16]. 


Training of CNN is done in two stages which are the training stage and testing stage. During the 
training stage, a set of labelled data needs to be provided. This stage will involve the convolution process, 
subsampling and multiplying the end outcome of the subsampling with certain artificial neural weights 
during the training process. While the testing stage is to apply input data and test on the trained model [10]. 


3. METHODOLOGY 

Figure 3 shows the methodology used for this work. First, the dataset is prepared. It then goes 
through pre-processing, labelling and conversion to numerical representation. After these steps 
are done the data is ready use to train and test on the CNN model. Lastly, the performance of the sentiment 
analysis task is analyzed. 
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Figure 3. Data processing 


3.1. Proposed CNN architecture 

The proposed architecture shown in Figure 4 and Table 1 is the CNN architecture adopted from [20] 
with the addition of a dropout layer to reduce the overfitting problem. 
1. Each word is mapped to a word vector representation 
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— The entire tweet can be mapped to a matrix of size s x d, where s = number of words and d = 
dimension of the embedding space 

2. Zero padding 

To make sure all tweets have the same matrix dimension. 

Max pooling layer 

— Extracts the most important feature for each convolution. 

— Combine all the cmax of each filter into one vector. 

4. Drop out layer 

Explicitly altering the network architecture at training time 

Softmax layer 

— Give out final classification probabilities 
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Figure 4. Proposed CNN architecture 


Table 1. Baseline CNN configuration. 


Parameter Value 
Filter Region 2354 
Feature Map 100 
Pooling Layer Max Pooling 

Activation Function ReLu 
Drop Out Layer Rate 0.5 


3.2. Data pre-processing 

The raw text comes in a variety of forms, for example, individual words, to sentences, multiple 
paragraphs, even with special characters and contains a lot of redundant information. In our case, the data is 
referred to as raw tweets. All these raw texts need to be cleaned up to fit with the machine learning algorithm 
well. The data pre-processing steps proposed are as follows: 
Split hashtag to words 
Capitalization 
Remove number, html tag, url 
Stemming, lemmatizing (normalization) 
Stop word removal 
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6. Tokenization 
7. Data annotation/labelling (positive, negative, neutral) 


3.3. Text to numerical representation 

Before input, the data for the machine learning algorithm to process, the data in the textual format is 
converted to numerical format using Word2Vec. Word2Vec is a two-layer neural net that processes text. 
Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. 
While Word2vec is not a deep neural network, it turns text into a numerical form that deep nets 
can understand. The purpose and usefulness of Word2vec are to group the vectors of similar words in vector 
space. That is, it detects similarities mathematically. Word2vec creates vectors that are distributed numerical 
representations of word features, features such as the context of individual words. 


3.4. Training strategy 

After the data pre-processing step and model are defined, the next step to determine is the training 
strategies to train the model. A Graphics Processing Unit (GPU) is chosen to train the model. Using a GPU 
saves substantial amounts of time since a GPU can support a higher number of parallel tasks compare to the 
Central Processing Unit (CPU) due to the higher number of cores. At the software level, Keras package 
implements the training procedure [21] 

Next, another important item on deep learning is the data set used to train the model. In this study, 
two different data sets were used separately to train the English SA model and Malay's SA model. 
For the English SA model, the sentiment140 data set was used [22]. This data set contains about 16 million 
English tweets with positive and negative labels. It is widely used in research purposes. For the Malay SA 
model, the MALAYA data set was used [23]. It contains about 6 million Malay Tweets with positive and 
negative labels. 

For each of the data set, the training is configured to set 80% of the data to become the training data 
while the 20% that left will be separate half become validate data and another half become testing data. 
Besides, the training is configured to use 100 epochs. 


4. RESULTS AND ANALYSIS 
4.1. Text to numerical representation result 

This section reports the results from the Word2Vec numeric conversion. After the Word2Vec model 
is trained, the weight of the words (numeric format) will be fixed and would not be changed. The trained 
Word2Vec model becomes one of the layers in the final CNN sentiment analysis model. The main purpose 
and usefulness of Word2vec are to group the vectors of similar words in vector space. As shown in Figure 5, 
Word2Vec shows the mathematical similarities for the top 20 most similar terms for different input words. 


get_related_ terms(ucute’) get_related_ terms ayang get_related_terns malaysia’) 


funny 0.579 


cuddly @.46. 
Pi 


chubby 


classy 0.431 
gorgeous 0.428 
attractive 0.428 
sexy 0.423 


Figure 5. Word2Vec similarities results 
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4.2. CNN model hyper-parameter tuning 

The hyper-parameters are the settings can be tuned to optimize the deep learning model. 
In this project there are few hyper-parameter had been determined and tuned. These are the filter region size, 
feature maps and drop out layer rate. Tables 2-6 reports the various adjustments made to the architecture 
during training. 


Table 2. Results of difference single Table 3. Results of different multi-filter 
filter region sizes region sizes 
Iteration Region Size Accuracy Iteration Multi-Region Size Accuracy 
Tl 1 76.00% Tl 2,3,4 76.56% 
T2 3 77.30% T2 3,4,5 76.85% 
T3 5 76.93% T3 4,5,6 76.74% 
T4 7 76.52% T4 5,6,7 76.09% 
T5 10 76.42% T5 7,8,9 76.59% 
T6 15 76.68% T6 3,4,5,6 76.85% 
T7 20 76.91% T7 6,7,8,9 76.80% 
T8 30 76.58% T8 3,3,3 77.37% 
T9 3,3,3,3 76.48% 
Table 4. Results of difference feature maps Table 5. Result of different drop out rate 
Iteration Feature Maps Accuracy Iteration Region Size Accuracy 

eis 10 76.35% Tl 0.1 76.17% 

T2 50 T712% 72 0.2 71.37% 

T3 100 77.37% T3 0.3 71.17% 

T4 200 77.59% T4 0.4 77.34% 

TS 400 TT AS% 15 0.5 77.59% 

T6 600 77.56% T6 0.6 77.40% 

T7 1000 77.26% T7 0.7 76.99% 

T8 2000 77.37% T8 0.8 76.91% 

T8 0.9 771 44% 


After going through many iterations of the hyper-parameter tuning. The best CNN model configuration that 
obtained are shown in Table 6. 


Table 6. Best configuration after hyper-parameter tuning 


Feature Parameter 
Filter Region 3,3,3 
Feature Map 200 
Pooling Layer Max Pooling 
Activation Function ReLu 
Drop Out Layer Rate 0.5 


4.3. Model performance and accuracy 

After tuning all the parameters, the English and Malay Sentiment Analysis CNN models were be 
trained and the final accuracy results were obtained. The results of our models against other methods are 
listed in Table 7. 


Table 7. Accuracy comparison with published works. 


Model Accuracy 
SVM [25] 75.86% 
Naive Bayes [25] 77.45% 
Indonesian CNN without Normalizer [25] 69.92% 
Indonesian CNN with Normalizer [25] 65.45% 
Indonesian LSTM with Normalizer [25] 73.22% 
This work English SA CNN 81.87% 
This work Malay SA CNN 71.59% 


Very limited work on Malay sentiment analysis has been done using deep learning approaches. Therefore, 
the comparison can only be done against a similar work in Bahasa Indonesia by [24]. There is another work 
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done in Bahasa Indonesia which achieved 100% but the dataset was too small (less than 300 samples) [25]. 
Although there are some limitations were determined in this comparison, but it can serve as an indicator to 
provide a general view on the current standing of this project. 

The accuracy obtained from our models is comparable and slightly better than [24]. From the table, 
SVM and Naive Bayes are the machine learning approaches with accuracies close to the results obtained in 
this work. However, deep learning approaches have a significant advantage due to its data-driven 
characteristic meaning that the more data is fed the higher its accuracy. On the other hand, other machine 
learning algorithms will hardly improve their accuracy after reaching a certain value. 


5. CONCLUSION 

In conclusion, the deep learning approach has been successfully applied in the task of Malay 
sentiment analysis. The developed sentiment analysis model was able to classify the tweet text into two 
sentiment categories which are positive and negative. The deep learning architecture that was used to build 
the Malay Sentiment Analysis model was based on the Convolutional Neural Network (CNN). To feed the 
text data to the Convolution Neural Network, the Word2Vec word embedding model was built from scratch 
to convert the text input to numerical representation. The model has also been going through the hyper- 
parameter tuning to achieve its optimal performance. The Malay sentiment analysis CNN model has been 
successfully built and been validated to achieved accuracy up to 77.59%. 

There are some potential enhancement and modification can be done on the current design. Firstly, 
the current design uses static word embedding, which means after the word embedding model was trained, 
the weight for the words (numerical representation) would be fixed. Dynamic word embedding is a possible 
enhancement. Using dynamic word embedding, the weight of the word embedding model can be tuned while 
on the CNN training state. Lastly, the different text pre-processing techniques can explore to enhance the 
model. For text classification, text-processing is equally important with the deep learning model building. 
A little modification done on this stage will highly influence the outcome. 
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